Abstract: A document page may consist of text words, numerical in different regional script along with the English and/or National language.Especially the documents in multilingual country or in the border area may have this scenario to convey information at mass. The monolingual OCR fails to identify such other script words and hence script identification becomes essential in such cases. Script identification is one of the challenging steps in the Optical Character Recognition system for multi-script documents. In this work we propose a word-wise script identifier considering all the south Indian languages. The proposed method uses morphological features such as dilation and erosion and reconstruction as base and a nearest neighbor classifier is used to classify the script. The method showed robustness in the estimation of script when tested on 600 word document images.The overall accuracy is found to be 98.1%
Keywords: OCR, Script Identification, morphological reconstruction, multilingual documents, multi script documents, NN classifier.